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1. INTRODUCTION 

Customer segmentation is the process of classifying customers into separate categories based on 
shared characteristics [1], [2]. The customer segmentation gives the understanding about customers which is 
needed by the company, helps to identify the prospective customers [3], and helps to classify the customers 
with similar characteristics. In customer segmentation, there is segmentation based on behavior with the most 
commonly used segmentation models being recency, frequency, and monetary [4]. The field of retail marketing 
and retail decision-making has extensively researched customer behavior [5]. For the past two decades, this 
model has been utilized to categorize customer databases according to their purchasing behavior [6]. 

Hughes introduced the of recency, frequency, and monetary (RFM) model in 1994 as a means of 
analyzing customer behavior [7]. The model factors in a customer’s recency (the interval since their most 
recent purchase), frequency (the number of purchases they make in a particular time period), and monetary 
value (the amount they spend on each transaction) to determine their value [2]. As a result, businesses can 
determine which customers are worth engaging and maintaining by using this effective method of predicting 
customers’ future purchasing behavior [8]. 

212 mart is a retail business that focuses on the monetary aspect when it comes to customer 
segmentation. 212 mart only gave a special treatment to the customers who had high monetary value, or the 
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customers who made large transaction value. 212 mart had done nothing from the recency and frequency 
side, which made it ineffective to identify the prospective customers. As a result, a recency variable needed 
to be added to provide information about the interval between the latest transaction time and analysis time. 
Also, a frequency variable was needed to provide information that the customers with high frequency showed 
bigger customer loyalty. The prospective customers can be identified effectively by using those three RFM 
variables and can also be used as a development of an effective marketing strategy [8]. In the RFM model, 
the customers are categorized into 4 characteristics based on the RFM average value. They are prospective 
customers, new customers, lost customers, and loyal customers [9]. 

Aside from the segmentation based on RFM, the segmentation can also be analyzed based on 
demographic, which is the most common form of market segmentation and the easiest to understand as well. 
The information obtained from demographic segmentation is easy to interpret, collect, and transfer from one 
study to another due to the ease of collecting such information [10]. The variables of demographic that are 
commonly used are age, gender, family size and type, income, occupation, education level, race, and 
nationality [11]. Demographic is a statistic defined for the customer population data set. Also, demographic is 
used in marketing and public opinion polls or public view of a trend [12]. 

Clustering is a data mining method employed to group data into various segments according to their 
characteristics. The data with similar characteristics will be in the same segment, while those who do not 
have similar characteristics will be separated into a different segment [13]. One of the commonly used 
algorithm in clustering techniques is density based spatial clustering of applications with noise (DBSCAN). 
The algorithm can find clusters of any shape at one density condition [14] and can handle large-scale data, 
can detect an outlier and categorize bigger data with different form and size [15]. 


2. METHOD 

This research consisted of 5 steps, the first step was preprocessing data. The data were selected based 
on the RFM criteria, and then were transformed into the RFM. After that, the data were normalized so that the 
data scale would not be too far, as the M value was currency value in rupiah. Unlike R and F values [16], whose 
values were normalized with the Min-Max method and a range of 0 to 1, this method used a range of 0-1 [16]. 
In this study RN is normalized for recency, RF is normalized for frequency and MN is normalized for monetary. 
Here is the formula for calculating the normalization number : 


v= ( Vi- ming) 


—,-) (newmax, — newmin,) + newmin, (1) 


The second one is clustering the data by using the DBSCAN algorithm. We need to determining the 
optimal values of epsilon and MinPts. In order to do it, a k-dist graph was used, by observing the shift of 
epsilon values from k values. The points that experienced drastic shift or change in the k-dist graph were 
used as the epsilon values, while the k values were used as MinPts [15]. The third step is measuring the 
cluster quality using silhouette index (SI). After obtaining the best cluster, the fourth step is determining the 
rank symbol of each cluster. The average value of the RFM attribute for each cluster was used to look for the 
rank symbol. Finally, in the fifth step, an analysis based on the demographic variable was performed. 


2.1. RFM model 

Hughes introduced the RFM model in 1994, which is a behavior-based customer segmentation 
technique that evaluates a customer’s past behavior. It segments customers based on recency (the time their 
most recent buy and the current), frequency (the volume of transactions during a specific time period), and 
monetary value (how much was spent on transactions) [8]. The RFM model enables companies to easily 
assess customer loyalty towards their products and services, allowing them to optimize their profits [17]. 

In order to determine customer value based on RFM value, clusters of customers with an RFM value 
higher than average are denoted with ‘ft’, while clusters with a lower RFM value are denoted with ‘|’. 
The cluster of customers denoted by R f F ft M 7 is referred to as loyal customers, while those denoted by 
R | F | M \ are considered lost of customers. Customers denoted by R ft F | M | are new customers, and 
those denoted by R ÙT F T M | are prospective customers. Table 1 provides an explanation of customer 
characteristics based on their RFM value [9]. 


2.2. DBSCAN clustering 

DBSCAN is a density-based clustering algorithm that clusters data points with high density into a 
group [18]. The algorithm is guided by two essential parameters, epsilon and MinPts. Epsilon represents the 
greatest distance within a cluster of data values, while MinPts represents the minimum number of data points 
required to form a cluster within the epsilon radius [19]. 
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The followings are the steps of the DBSCAN algorithm: 
a) Randomly select a data point from the dataset as the starting point for the core point candidate. 
b) Establish the values for epsilon and MinPts. 
c) Ifthe starting point which has been selected is sufficient with the core point based on the user-defined 
epsilon and MinPts values, a cluster will be formed with its neighboring object. The distance between 
the object in the core point and neighboring object can be measured by using euclidean equation: 


dyy = V Xin Xi = yi)? (2) 


d) Ifthe beginning point object is a border point and the starting point object does not have a density-reachable 
connection with it, the DBSCAN will visit the following object from the data set to become the following 
core point. 

e) Do process 3 and 4 again until all points have been visited. 

f) Ifthe selected object is not sufficient as a center point or border point in the formed cluster, then it will 
belong to the outlier data, which are the data that have bigger distance than the distance between epsilon 
and core point, but have less number of objects than the specified MinPts. 


Table 1. The customer characteristics based on the RFM value 


No ___RFM symbol Cluster name Description 


1 Rt Ft Mt Loyal customers The customers who have a high average monetary value indicate that the amount of 
money issued for the companies is great in value as well. Also, the high average of 
frequency and recency indicates that the customers often make transactions. 

2 RL F, M} The lost of customers Customers in this segment have low average values for monetary, frequency, and 

recency, indicating that they seldom make transactions and spend less money than 

the average customer. 

3 Rt F}, M, New customers This segment is designated for customers who have recently started purchasing, 

with a lower number of transactions and total spending amount than the average 

transaction. 

4 Rt Ft M} Prospective customers This segment shows the customers who have high average recency and frequency 
values, but have low average monetary value. They have a high level of response, 
just recently make transactions which are quite often, and thus, make them become 
the prospective customers for the companies. 


2.3. Silhouette index validation 

Silhouette index was firstly introduced by Rousseeuw in 1987, who combined the polymerization 
factor of intra-cluster and resolution between clusters to evaluate the cluster quality [20], [21] in order to 
better represent the separability of clusters, and to be cluster validity. Silhouette index is useful when the data 
are on a ratio scale (euclidean distance) and when looking for a clearly separated data set [22]. Silhouette 
index describes a description of the accuracy of an object in its occupied cluster. The optimal cluster has a 
high silhouette index value or close to 1. If the si value is close to 1, it means that the cluster is very dense. 
However, If the si value is near to -1, the cluster that contains object i is not dense [23]. Here is the equation 
used to measure the silhouette index value: 


n b(i)-a(i) 
Sas (a(i), b (D) (3) 


3. RESULTS AND DISCUSSION 

The data used were the transaction data of the customers who have member cards in 212 mart and 
made transactions on January—December 2020. There were 1,205 customers who met these criteria. The data 
were selected based on the RFM criteria and was normalized using by using (1) with range 0-1. The result is 
shown in Table 2 and Table 3. 

The next step was clustering the data using the DBSCAN algorithm. To determining the optimal 
values of epsilon and MinPts, a k-dist graph was used. K-dist was searched at k = 3, k = 4, and k = 5 using 
R studio. The results of k-dist graph values are as shown in Figure 1, Figure 2, and Figure 3. 

Based on Figure 1, Figure 2, and Figure 3, the points which experience the drastic change are at 0.06 
until 0.08. Therefore, the optimal values of epsilon and MinPts are in the range 0.06 and 0.08 with MinPts 3, 
4, and 5. The DBSCAN results can be seen in Table 4. After obtaining the results of the clusters, the next 
thing to do was validating the cluster to know the optimal number of clusters, as well as the quality and 
power of clusters in each epsilon and MinPts values. The results of the cluster validation are in Table 5. 
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Table 2. The customer data in RFM model Table 3. The normalization data 
Customer number R F M (Rp) Customer number RN FN MN 
1 241 1 412,700 1 0.657 0.000 0.010 
2 27 10 1,241,100 2 0.066 0.066 0.031 
3 6 33 4,135,700 3 0.008 0.234 0.104 
4 6 14 2,738,200 4 0.008 0.095 0.069 
5 17 6 — 1,080,300 5 0.039 0.036 0.027 
6 333 2 289,500 6 0.912 0.007 0.007 
7 22 19 3,711,000 7 0.052 0.131 0.093 
8 29 18 5,385,200 8 0.072 0.124 0.135 
9 3 6 1,217,600 9 0.000 0.036 0.030 
10 217 4 919,100 10 0.591 0.022 0.023 
1205 189 1 213.600 1205 0.014 0.036 0.017 
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Table 4. The result of DBSCAN clustering Table 5. The results of silhouette index 
Epsilon _Minpts _ Number of cluster__ Noise Epsilon Minpts _ Number of cluster SI value 
0.06 3 5 31 0.06 3 5 0.4222 
0.06 4 2 42 0.06 4 2 0.2947 
0.06 5 2 45 0.06 5 2 0.2951 
0.07 3 4 22 0.07 3 4 0.4145 
0.07 4 3 32 0.07 4 3 0.2818 
0.07 5 2 36 0.07 5 2 0.2747 
0.08 3 2 18 0.08 3 2 0.2901 
0.08 4 2 20 0.08 4 2 0.2896 
0.08 5 2 27 0.08 5 2 0.2781 


Based on Table 5, the highest SI value is at Eps 0.06 and MinPts 3 whose SI value is close to 1. It is 
0.4222 and, hence, can be said as the most optimal cluster. The epsilon value 0.06 and MinPts 3 produce 5 clusters 
with 31 noisy data. The next consist of 1,118 customers in cluster 1, 7 customers in cluster 2, 14 customers 
in cluster 3, 9 customers in cluster 4, and 26 customers in cluster 5. 

The next step was determining the rank symbol of each cluster. The average value of the RFM 
attribute for each cluster was used to look for the rank symbol. The cluster whose average value of frequency 
and monetary was higher than the average value of frequency and monetary before clustering was given the 
symbol f, while the cluster whose average value of frequency and monetary was lower than the average value 
of frequency and monetary before clustering was given the symbol | [9]. In contrast to frequency and monetary, 
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if the average value of recency after clustering was higher than the one before clustering, it was given the 
symbol |. On the contrary, if the average value of recency after clustering was lower than the one before 
clustering, it was given the symbol Î. This was because the shorter the interval between the last purchase 
time and analysis period, the greater the recency value [24]. Table 6 displays the RFM average value prior to 
clustering, Table 7 displays the best RFM cluster average values, and Table 8 displays the rank symbol 


findings for each cluster. 


Table 6. The RFM before clustering 
Average of R Average of F Average of M 
85.45 12 2.371.326 


Table 7. The RFM average of each cluster 
Cluster R F M 

83.93 13.67 1,780,118 

23.14 34.28 13,747,571 

13.50 41.64 14,729,150 

9.22 71.66 24,131,233 

28.07 34.80 10,585,200 


nABRWNe 


Table 8. The RFM rank symbol of each cluster 


Cluster Number of RFM rank 
customer symbol 
1 1118 Rt Ft M} 
2 7 Rt Ft Mt 
3 14 Rt Ft Mt 
4 9 Rt Ft Mt 
5 26 Rt Ft Mt 
Table 9. The demographic of cluster 1 Table 10. The demographic of loyal customers 
(prospective customers) clusters 
Variable Group F % Variable Group F % 
Gender Female 744 67% Gender Female 744 67% 
Male 374 33% Male 374 33% 
Age 18-24 71 6% Age 18-24 71 6% 
25-34 272 24% 25-34 272 24% 
35-44 378 34% 35-44 378 34% 
45-54 280 25% 45-54 280 25% 
55-64 99 9% 55—64 99 9% 
65 and above 18 2% 65 and above 18 2% 
Occupation Lecturer 312 28% Occupation Lecturer 312 28% 
Teacher 96 9% Teacher 96 9% 
Private employee 262 23% Private employee 262 23% 
Entrepreneur 255 23% Entrepreneur 255 23% 
College student 67 6% College student 67 6% 
Medical staff 36 3% Medical staff 36 3% 
Civil servant 61 5% Civil servant 61 5% 
Housewife 29 3% Housewife 29 3% 
Address Pekanbaru 1067 95% Address Pekanbaru 1067 95% 
Outside Pekanbaru 51 5% Outside Pekanbaru 51 5% 
Marital status Married 1028 92% Marital status Married 1028 92% 
Single 90 8% Single 90 8% 


Based on Table 8, the group of customers in cluster 1 is the customers categorized as prospective 
customers (R + F f M |). Customers in this group have higher average R and F values than the average 
transaction value, which indicate that those customers have recently shopped in a frequent or repeated period 
of time. The 212 mart party can actively contact these customers to offer new products accompanied by 
promotional activities and various new gifts which aim at increasing the customers’ interest in buying the 
products and increasing the sum of money paid. 

Cluster 2, 3, 4, and 5 have the same rank symbol (R f F t M f), which shows that the customers in 
those clusters belong to the category of loyal customers. The customers in this segment are highly retainable 
customers. The 212 mart party must maintain the customers’ loyalty by regularly giving information on the 
latest products. Through transactions, we can better comprehend their purchasing behavior and needs and 
providing benefits which the customers can get every time they make a transaction on a certain value. 
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The company can also increase the reward program for the customers according to the spending made by them. 
The analysis of customer segmentation based on demographic variables, such as gender, age, employment, 
address, and marital status, was the next task after getting the RFM rank for each cluster. Table 9 and Table 10 
show the consumer demographics. 

Based on Table 9 and Table 10, the information is obtained that the majority of the customers are 
from the prospective customers category (cluster 1) and consist of 1118 customers. It is many of them are 
middle-age customers with an age range of 35—44 years (34%), dominated by female (744.67%), work as a 
lecturer (312.28%), live in Pekanbaru (95%), and, mostly, is married (1029.92%). Meanwhile, the number of 
the customers from the category of loyal customers (cluster 2, 3, 4, and 5) are 56 people in total, and most of 
them are 35—44 years old (30%), female, which as many as 42 people (75%), mostly work as a lecturer 
(41%), live in Pekanbaru (95%), and almost all of them are married (92%). 


3.1. Discussion 

This study segmented the customers of 212 mart based on RFM and demographics using the 
DBSCAN algorithm. Customer data were clustered into different segments based on RFM variables. Then, 
the data were analyzed based on RFM rank and estimating each cluster’s average value categorizing 
consumers based on their traits in accordance with the theory [9]. Two customer segments were obtained, 
namely loyal and potential category. Then, both customer characteristics were analyzed based on 
demographics. The demographic variables used were age, gender, occupation, address and marital status. 
This analysis produces an understanding of customers and proposed strategies that will be applied to each 
customer segment based on their characteristics. 

A research on RFM and demographic-based consumer segmentation had previously been carried 
out [25] using customer data of five-star hotels in Antalya, Turki. The difference with this study is the 
algorithm used for the clustering process, which was the self-organizing map (SOM) and K-means algorithm. 
Also, the demographic variables used here were gender, age, nationality and travel companion. The findings 
showed 8 clusters, with the majority of customers belonging to the “lost customers” segment, who remain for a 
shorter amount of time, and being predominately male. Results showed that RFM clusters the customers 
effectively, which might encourage senior managers to develop original suggestions for improving their 
customer relationship management (CRM) abilities. 


4. CONCLUSION 

This study provides information that from the existing 5 clusters, 2 categories of customers are 
generated based on customer characteristics, namely loyal customers and prospective customers. Customers 
who belonged in the loyal category made repetitive transactions and often spent large amount of money. 
Customers belong to this category are very worthy of being maintained by the company by providing the best 
service, so that these customers will stay and not become targets by other companies. 

Based on demographic analysis, the majority of 212 mart’s customers were middle-age customers 
(35-44; 34%), female (786; 67%), and married (1080; 92%), which showed that the majority of 212 mart’s 
customers are housewives. This can also be seen from the products sold at 212 mart were products that were 
related to women or housewives. Besides that, the mostly purchased items were groceries such as rice and oil, 
as well as household items such as washing supplies and toiletries. In addition, the majority customers of 212 mart 
worked as lecturers (335; 25%), and stayed in Pekanbaru. It was because the target customers of 212 mart were the 
upper middle class, such as lecturers, and also the location of 212 mart was close to the campus. 
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